Pragmatically annotated corpora in speech-to-speech translation

نویسنده

  • Darinka Verdonik
چکیده

The aim of this paper is to discuss and specify some pragmatic language categories that could be used as attributes in spontaneous speech corpora, especially the corpora used for developing speech-to-speech translation systems components. When developing the speech-to-speech translation, researchers have to deal with spontaneous (conversational) speech phenomena like hesitations, turntaking behaviors, self-repairs, false starts, filled pauses... This makes speech-to-speech translation a very hard task, with much space for improvement. Language technologies use linguistically annotated corpora and lexica (morphologic, syntactic, semantic...) to achieve better performance. In this paper I suggest to include pragmatic attributes of annotation to deal with some of the above mentioned phenomena of spontaneous speech. Pragmatično označeni korpusi v strojnem simultanem prevajanju govora Namen tega prispevka je definirati nekatere pragmatične jezikovne kategorije, ki jih lahko uporabimo kot atribute v pragmatično označenih govornih korpusih, zlasti tistih, ki se uporabljajo pri razvoju sistemov strojnega simultanega prevajanja govora. Raziskovalci, ki delajo na področju tehnologije strojnega simultanega prevajanja govora, opozarjajo, da je v pogovoru polno elementov, kot so obotavljanja, menjavanje vlog, samopopravljanja, napačni začetki, premori... Te značilnosti so problematične za strojno simultano prevajanje govora in zahtevajo ustrezne rešitve. Pri razvoju jezikovnih tehnologij se uporabljajo jezikoslovno označeni korpusi in slovarji (oblikoslovni, skladenjski, semantični...), saj pripomorejo k večji uspešnosti tehnologije. V tem prispevku predlagam vključevanje pragmatičnih atributov za označevanje govornih korpusov, da bi na tak način premoščali težave pri razvoju strojnega simultanega prevajanja govora, ki jih navajam zgoraj.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Morphologically and Syntactically Annotated Corpora of Many Languages

Annotated corpora have become a standard resource for research in both linguistics and computational processing of natural languages. Lexicographers judge word usage and distribution by occurrences in corpora; part-of-speech tags may help them narrow their queries. Grammarians may use syntactically annotated corpora (treebanks) for queries such as “show me all examples where a verb governs two ...

متن کامل

Categorizing expressive speech acts in the pragmatically annotated SPICE Ireland corpus

Expressive speech acts are one of the five basic categories of speech acts identified by Searle (1976). Expressives remain underresearched, though select categories of expressive speech acts, especially offering thanks and compliments, have received more extensive attention. An overall classification of expressive speech acts on the basis of corpus data has not yet been carried out. The current...

متن کامل

Exploring EFL Learners’ Use of Formulaic Sequences in Pragmatically Focused Role-play Tasks

Communicative language use largely entails regular patterns consisting of pre-constructed phrases or sequences. These sequences have been examined by many researchers to find the situation-based formulas which may help L2 learners follow a possibly more target-like speaking system. This study, therefore, explored two categories of formulaic expressions including speech formulas and situation-bo...

متن کامل

Recent Advances of Speech Databases Development Activity for Indian Languages

Development of Speech Corpora and acoustic–phonetic data bases are indispensable for any research and development work in spoken language systems. Systematic efforts have been made to create speech databases for some major languages of India. The paper attempts to present the status and the recent advancements made in corpora development for some of the Indian languages. Different types of data...

متن کامل

Developing Morphologically Annotated Corpora for Minority Languages of Russia

Despite recent progress in developing annotated corpora for minority languages of Russia, still only about a dozen out of about 100 have comprehensive corpora, and even less have computational tools such as machine translation systems or speech recognition modules. However, given that many of them have resources such as dictionaries and grammars, the situation can be improved at relatively low ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006